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Markov chain Monte Carlo algorithms play a key role in the 
Bayesian approach to phylogenetic inference. In this paper, we present 
the first theoretical work analyzing the rate of convergence of several 
Markov chains widely used in phylogenetic inference. We analyze sim- 
ple, realistic examples where these Markov chains fail to converge 
quickly. In particular, the data studied are generated from a pair 
of trees, under a standard evolutionary model. We prove that many 
of the popular Markov chains take exponentially long to reach their 
stationary distribution. Our construction is pertinent since it is well 
known that phylogenetic trees for genes may differ within a single or- 
ganism. Our results shed a cautionary light on phylogenetic analysis 
using Bayesian inference and highlight future directions for potential 
theoretical work. 

1. Introduction. Bayesian inference of phylogeny has had a significant 
impact on evolutionary biology [14]. There is now a large collection of 
popular algorithms for Bayesian inference, including programs such as Mr- 
Bayes [13], BAMBE [22] and PAML [21, 25]. All of these algorithms rely on 
Markov chain Monte Carlo methods to sample from the posterior probability 
of a tree given the data. In particular, they design a Markov chain whose sta- 
tionary distribution is the desired posterior distribution, computed using the 
likelihood and the priors. Hence, the running time of the algorithm depends 
on the convergence rate of the Markov chain to its stationary distribution. 

Therefore, reliable phylogenetic estimates depend on the Markov chains 
reaching their stationary distribution before the phylogeny is inferred. A 
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variety of schemes (such as multiple starting points [12]) and increasingly 
sophisticated algorithms (such as Metropolis-coupled Markov chain Monte 
Carlo in MrBayes [13]) are heuristically used to ensure that the chains con- 
verge quickly to their stationary distribution. However, there is no theo- 
retical understanding of the circumstances which the Markov chains will 
converge quickly or slowly. Thus, there is a critical need for theoretical work 
to guide the multitude of phylogenetic studies using Bayesian inference. 

We consider a setting where the data are generated at random, under a 
standard evolutionary model, from the mixture of two tree topologies. Such 
a setting is extremely relevant to real- life data sets. A simple example is 
molecular data consisting of DNA sequences for more than one gene. It is 
well known that phylogenetic trees can vary between genes (see [11] for an 
introduction). 

We prove that in the above setting, many of the popular Markov chains 
take extremely long to reach their stationary distribution. In particular, the 
convergence time is exponentially long in the number of characters of the 
data set (a character is a sample from the distribution on the pair of trees). 
This paper appears to be the first theoretical work analyzing the conver- 
gence rates of Markov chains for Bayesian inference. Previously, Diaconis 
and Holmes [6] analyzed a Markov chain whose stationary distribution is 
uniformly distributed over all trees, which corresponds to the case with no 
data. 

Our work provides a cautionary tale for Bayesian inference of phylogenies 
and suggests that if the data contains more than one phylogeny, then great 
caution should be used before reporting the results from Bayesian inference 
of the phylogeny. Our results clearly identify further theoretical work that 
would be of great interest. We discuss possible directions in Section 3. 

The complicated geometry of "tree space" poses highly nontrivial difficul- 
ties in analyzing maximum likelihood methods on phylogenetic trees, even 
for constant tree sizes. 

Initial attempts at studying tree space include work by Chor et al. [3] 
which constructs several examples where multiple local maxima for likeli- 
hood occur. Their examples use nonrandom data sets (i.e., not generated 
from any model) on a four species taxa and the multiple optima occur on a 
specific tree topology, differing only in the branch lengths. 

A different line of work beginning with Yang [24] analytically determines 
the maximum likelihood over rooted trees on three species and binary char- 
acters. Since then, some sophisticated tools from algebraic geometry have 
been used to study the likelihood function and other polynomials on tree 
space (see, e.g., [7, 23]). It appears that the main result on tree spaces 
needed in this paper does not follow directly from the algebraic geometry 
methodology. 
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Note 1. The results proved here were presented to a wide scientific 
audience in a short report published in [20]. 

1.1. Formal statement of results for binary model. We present the formal 
definitions of the various notions and then precisely state our results. 

Let Q denote the set of all phylogenetic trees for n species. Combinatori- 
ally, O is the set of (unrooted) trees T = (V, E) with internal degree 3 and 
n leaves. 

The likelihood of a data set for a tree is defined as the probability that the 
tree generates the data set, under a given evolutionary model. For simplicity, 
we first discuss our results for one of the simplest evolutionary models, known 
as the Cavender-Farris-Neyman (CFN) model [2, 8, 19], which uses a binary 
alphabet. Our results extend to the Jukes-Cantor model with a four-state 
alphabet and to many other mutation models, as discussed later. 

For a tree T € f2, let V ext denote the leaves, Vint denote the internal ver- 
tices, E denote the edge set and p:E— > [0,1/2] denote the edge mutation 
probabilities. The data are a collection of binary assignments to the leaves. 
Under the CFN model, the probability of an assignment D : Vext — ► {0, 1} is 

Pr( J D|T,p) = i £ J] (l-P(e)) II P( e )- 

D'e{0,l} v : e=(u,v)€E(T): e=(u,v)eE(T): 
D'(V ex t)=D(V ext ) D'(u)=D'(v) D'(u)^D'(v) 

Below, we will further assume that the marginal distribution at any node of 
the tree is given by the uniform distribution on {0, 1}. 

Note that when p(e) is close to zero, the endpoints are likely to receive 
the same assignment, whereas when p(e) is close to 1/2, the endpoints are 
likely to receive independently random assignments. Under the "molecular 
clock" assumption, edge e has length proportional to — log 2 (l — 2p(e)). 

An algorithmic method for generating a character D for a tree T with 
weights p is first to generate a uniformly random assignment for an arbitrary 
vertex v. Then beginning at v, for each edge e = (v, w), given the assignment 
to one of the endpoints, the other endpoint receives the same assignment 
with probability 1 — p(e) and a different assignment with probability p(e). 

Finally, for a collection of data D = {D\, ■ . ■ , -Djv), 

Pr(D|T,p)= [] Pr(£|T,p) 



exp( log(Pr(£>|T,p))J. 



Now, applying Bayes rule, we can write the posterior probability of a tree 
given the data, 

L Pr(D|T,p)#(T,p)dp 
Pr(TlD) = Jp ^ ' ^ , ^ 
V 1 ' PrD 
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/ p Pr(D|r,p)fr(r,p)rfp 



E T JpPr(D|r',p)*(T',p)dp' 



where \I/(T, p) is the prior density on the space of trees, so that 



[ *(T,p)dp = l. 



Since the denominator is difficult to compute, Markov chain Monte Carlo is 
used to sample from the above distribution. For an introduction to Markov 
chains in phylogeny, see [9]. 

The algorithms for Bayesian inference differ in their choice of Markov 
chain to sample from the distribution and in their choice of prior. In practice, 
the choice of an appropriate prior is an important concern. Felsenstein [9] 
gives an introduction to many of the possible priors. Rannala and Yang [21] 
introduce a prior based on a birth-death process, whereas Huelsenbeck and 
Ronquist's program MrBayes [13] allows the user to input a prior (using 
either uniform or exponential distributions). Our results hold for all these 
popular priors and only require that the priors are so-called e-regular for 
some e > 0, in the sense that 



Computing the weight of a tree can be done efficiently via dynamic pro- 
gramming in cases where \I/ admits a simple formula. In other cases, numer- 
ical integration is needed. See [9] for background. 

The transitions of the Markov chain (Tt) are defined as follows. From a 
tree Tt £ fi at time t: 

1. Choose a neighboring tree T'. See below for design choices for this step. 

2. Set Tt+i = T t with probability min{l,w(T')/w(T)}, otherwise set T t+ i = 



This is an example of the standard Metropolis algorithm. The acceptance 
probability of min{l, w(T') /w(T)} implies that the stationary distribution it 
is proportional to the weights w, that is, for T € fi, 



for all T and p 



*(r,p)>e. 



Each tree T £ f2 is given a weight, 




p 



T t . 



tt(T) 



w{T) 



Three natural approaches for connecting the tree space f2 are nearest- 
neighbor interchanges (NNI), subtree pruning and regrafting (SPR) and Tree- 
Bisection-Reconnection (TBR). In NNI, one of the n — 3 internal edges is 
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Fig. 1. Illustration of NNI transition. An internal edge has four subtrees attached. The 
transition reattaches the subtrees randomly. Since the trees are unrooted, there are three 
ways of attaching the subtrees, one of which is the same as the original tree. 

chosen at random and the four subtrees are reconnected randomly in one 
of the three ways; see Figure 1 for an illustration. In SPR, a random edge 
is chosen, one of the two subtrees attached to it is removed at random and 
reinserted along a random edge in the remaining subtree; see Figure 2. In 
TBR, one of the edges of the tree is removed to obtain two trees. Then the 
two trees are joined by an edge connecting two midpoints of edges of the 
two trees. The SPR moves are a subset of the TBR moves, but the two 
move sets are identical when the tree has fewer than six leaves. We refer to 
the above chains as Markov chains with discrete state space and NNI, SPR, 
TBR transitions, respectively. 

Some Markov chains instead walk on the continuous state space where a 
state consists of a tree with an assignment of edge probabilities. Our results 
extend to chains with continuous state space where transitions only modify 
the tree topology by an NNI, SPR or TBR transition and edge probabilities 



B C B C 




Fig. 2. Illustration of SPR transition. The randomly chosen edge is marked by an arrow. 
The subtree containing B and C is removed and reattached at the random edge marked by 
a starred arrow. The resulting tree is illustrated. 
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are always in (0,1/2). Some examples of continuous state space chains are 
[17] and [4, 16]. 

The mixing time of the Markov chain T m j x is defined to be the number 
of transitions until the chain (from the worst initial state) is within total 
variation distance < 1/4 from the stationary distribution. 

We consider data coming from a mixture of two trees, T\{a,a 2 ) and 
T 2 (a,a 2 ). T x is given by ((12), 3), (45), while T 2 is given by ((15), 3), (24); 
see Figure 3. On the trees T\{a,a 2 ) and T 2 (a, a 2 ), we have two edge prob- 
abilities, one for those edges incident to the leaves and a different one for 
internal edges. We let the probability of edges going to the leaves be a 2 and 
let the internal edges have probability a, where a will be chosen to be a 
sufficiently small constant. The trees Ti(a, a 2 ) and T 2 (a,a 2 ) will have small 
edge probabilities, as commonly occurs in practice. 

We let T>\ be the distribution of the data according to Ti(a, a 2 ) and V 2 the 
distribution according to T 2 (a, a 2 ). We let T> = 0.5(2?i + T> 2 ) and consider a 
data set consisting of N characters. 

We prove the following theorem: 

Theorem 1. For all sufficiently small a > 0, there exists a constant 
c > such that for all e > 0, the following holds. Consider a data set with 
N characters, that is, D = (D\, . . . , Dn), chosen independently from the 
distribution T> . Consider the Markov chains on tree topologies defined by 
nearest-neighbor interchanges or subtree pruning and regrafting. Then with 
probability at least 1 — exp(— cN) over the data generated, the mixing time 
of the Markov chains, with priors which are e-regular, satisfies 

T mix > ceexp(cN). 

Note that e only has a small effect on this lower bound for the mixing 
time. 

Remark 2. Our results show that the mixing times of the Markov chains 
are exponentially slow when the data are generated by a mixture of two trees 
and there are n = 5 leaves. One would expect that the same phenomenon 
holds for a much more general class of mixture distributions and for trees 
with larger numbers of leaves, but except for some trivial extensions of our 
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results, it is generally unknown when this is the case. This suggests the 
following open problems: 

• Our proof relies on proving that the likelihood function, under the mixture 
distribution T>, obtains values on trees T\ and T2 that are larger than the 
maximal value on any other tree. This raises the following question. For 
n = 5, for what branch lengths (i.e., edge probabilities) in the generating 
trees T\ and T2, under the resulting mixture distribution, is the likelihood 
at any tree whose topology is different from that of T\ and T2 smaller than 
the maximal value obtained at T\ and at T2? In such cases, the mixing 
time will be exponential in the number of characters. 

• More challenging is the case when there are significantly more than five 
leaves. For example, we do not know the answer to the following problem. 
Suppose that the data are generated from a mixture of trees T[ and T' 2 
on n > 5 leaves and that there exists a subset S of five leaves, where the 
induced subtree of T[ on S is Ti(a,a 2 ) and the induced subtree of T' 2 on 
S is 12(0, a 2 ). Does this imply that the mixing time is exponential in the 
sequence length (i.e., number of characters)? 

1.2. General mutation models. As mentioned above, our theorem is valid 
for many of the mutation models discussed in the literature. We now define 
these models and derive some elementary features of them that will be used 
below. In the general case, it is easier to define the evolution model on rooted 
trees. However, since we will only discuss reversible models, the trees may 
be rooted arbitrarily. Moreover, for general models, we consider rooted trees 
with edge lengths, as opposed to unrooted trees with edge probabilities. 

The mutation models are defined on a finite character set A of size q. 
We will denote the letters of this alphabet by a, (3, and so on. The mutation 
model is given by a q x q mutation rate matrix Q that is common to all edges 
of the tree, along with a vector (l(e)) e& E(T) of edge lengths. The mutation 
along edge e is given by 

exp(l(e)Q) = I + L(e)Q + + ^! + . . . . 

Thus, the probability of an assignment D : V e ^t — ► A is 

Pr(D|T,l)= ^D'{r) II [ ex P(K e )Q)]D'(u),D'{v), 

D'eA v : e=(u,v)eE(T) 
£>'(V C xt)=£>(V ext ) 

where all the edges (u, v) are assumed to be directed away from the root r 
and where tt denotes the initial distribution at the root. 

It is well known that the CFN model is a special case of the model above 
where Q= (" 1 _J) and p(e) = (1 - exp(-l(e)))/2. 

We will further make the following assumptions: 
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Assumption 1. 1. The Markov semigroup (exp(iQ))j>o has a unique 
stationary distribution given by it such that ir a > for all a. Moreover, the 
semigroup is reversible with respect to ir, that is, ~K a Q a ,p — ^pQp.a for all oi 
and p. 

2. The character at the root has marginal distribution ir. By the reversibil- 
ity of the semigroup, it then follows that the marginal distribution at every 
node is ir. 

3. The rate of transitions from a state is the same for all states. More 
formally, there exists a number q such that for all a, 

(1) Q a >P = ~Qa,a- 

In fact, by rescaling the edge length of all edges, we may assume without 
loss of generality that Q a ,a = 1 for all a. 

Remark 3. Parts 1 and 2 of the assumption together imply that we 
obtain the same model for all possible rootings of any specific tree. Thus, 
the model is, in fact, defined on unrooted trees. 

Remark 4. It is straightforward to check that our assumptions include 
as special cases the CFN model, the Jukes-Cantor model, Kimura's two- 
parameter model and many other models. See [18] for an introduction to 
the various evolutionary models. 

1.3. Statement of the general theorem. 

Definition 5. Let T be the space of all trees and edge lengths on five 
leaves. We say that a prior density $ on T is (e, a) -regular if for every T 
and 1 where 1(e) < 2a for all e, it holds that \f(T, 1) > s. 

Remark 6. For all a > 0, all of the priors used in the literature are 
(a, e) -regular for an appropriate value of e = e(a). 

Theorem 7. Let Q be a mutation rate matrix that satisfies Assump- 
tion 1. Then there exists an a > ; constants c > 0,7] > 0, two trees T\,T<i 
and open sets Si C (0,oo) E ^ Tl \ S 2 C (0, oo) s(T2) such that ifh £ Si,h G S 2 , 
then for the distribution T>\ generated at the leaves of (Ti,li) and the dis- 
tribution T>2 generated at the leaves of {T2A2), the following holds for T> = 
(0.5 - p)V x + (0.5 + p)D 2 for \p\ < rj and all e > 0: 

Let D = (Di, . . . , Djy), chosen independently from the distribution V. 
Consider a Markov chain on discrete or continuous tree space with only 
NNI, SPR or TBR transitions. Then with probability 1 — exp(— cN) over 
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the data generated, the mixing time of the Markov chain, for priors which 
are (a, e) -regular, satisfies 

T mix > ceexp(cN). 

Remark 8. It is straightforward to check that Theorem 1 is a special 
case of Theorem 7. This follows by the standard translation between edge 
lengths and edge probabilities. As mentioned above, the CFN model (as well 
as Jukes-Cantor and many other models) satisfies Assumption 1. 

2. Proof of the general theorem. We begin with the definition of the 
mixture distribution T> for general models. Let T>\ be the distribution at the 
leaves of the evolutionary model defined on the tree T\(a,a 2 ) and X> 2 be 
the distribution at the leaves of the evolutionary model defined on the tree 
T 2 (a,a 2 ). We let V = (0.5-p)V 1 + (0.5 + p)P 2 - 

We first expand the distribution T>. This is easily done in terms of C* , 
where C* is the set of cherries in T\ U T 2 . We will use the following definition 
of a cherry: 

Definition 9. Let T be a tree. We say that that a pair of leaves i,j 
is a cherry of T if there exists a single edge e of T such that removing e 
disconnects i and j from the other leaves of T. For a tree T, we let C(T) 
denote the set of cherries of T. 

Note that according to this definition, the "star" tree has no cherries. We 
clearly have C* = C(T X ) U C(T 2 ) = {(12), (15), (45), (24)}. 

Our theorem holds for a sufficiently small. Hence, the asymptotic no- 
tation in our proofs is in terms of 1/a — > oo. Thus, a = o(— a log a) and 
a 2 = o(— a log a) , since — log a — > oo as a — > 0. 

It is easy to estimate T> for small a. This follows from the following lemma: 

Lemma 10. For an edge e of length b, conditioned on the character at 
the endpoint of the edge, the probability that the other endpoint obtains the 
same label is 1 — b + 0(b 2 ). The probability that it obtains a different label 
isb + 0(b 2 ). 

Proof. Part 3 of Assumption 1, along with the expansion of exp(bQ), 
implies that 

exp(bQ) = I + bQ + 0(b 2 ). 

The lemma now follows from the fact that Y^p^a Qa,(3 = L which was stated 
as part 3 of Assumption 1. □ 
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We begin by compiling some simple facts concerning the mixture distri- 
bution T>. We will use the following notation for characters. By a, we denote 
the character that is constant a. We let F denote the set of all constant 
characters. By (a, i, f3), we denote the character that is a on all leaves except 
i, where it is f3. The set of all such characters is denoted by Fj. By (a, i,j, (3), 
we denote the character that is (3 on i,j and a on all other leaves. The set 
of all such characters is denoted by Fi~. We denote by G the set of all other 
characters. 

Lemma 11. The mixture distribution T> satisfies the following condi- 
tions: 

• for all a E F , 

(2) V[a] =ir a (l-2a + 0(a 2 )); 

• for all i, 

(3) V[F l ]=0(a 2 )- 

• for all (i,j)eC*, 

(4) V[F itj ]=a/2 + 0(a 2 ); 

• for (i,j)£C*, 

(5) V[F hJ ]=0(a 2 ); 
• 

(6) V[G)=0(a 2 ). 



Proof. We first claim that it suffices to assume that the mixture is 
generated from Ti(a, a?) and T2(a, a 2 ) with equal weights (p = 0). Since T>[a] 
is continuous in the edge lengths and the mixture probabilities, proving the 
bounds (2)-(6) for this mixture implies the same bounds for (Ti,li) and 
(T2,h), as long as l,(e) is close to a for internal edges and close to a 2 for 
terminal edges. Similarly, the same bounds will hold for mixture probabilities 
0.5 — p and 0.5 + p, as long as p is sufficiently small. 

Hereafter, a terminal edge refers to an edge connected to a leaf. Consider 
first the probability D ^ F . There are two ways in which this can occur: ei- 
ther there is a mutation on exactly one of the internal edges and no mutations 
on the terminal edges, which occurs with probability 2a(l — a) + 0(a 2 ) = 
2a + 0(a 2 ), or there is a mutation on a terminal edge and/or both internal 
edges, these occurring with probability 0(a 2 ) by our choice of edge lengths 
on Ti and T^. This implies (2). 

For D € Fi, there needs to be a mutation on at least one terminal edge, 
which implies (3). Next, consider a cherry E C*, say (i,j) = (1,2) E 
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C(T\). To generate D £ Fij, the dominant term is if we are generating from 
T\ and have a mutation on the internal edge to the parent of leaves 1 and 
2 or a mutation on more than one terminal branch. Alternatively, if we 
generate from T2, we need a mutation on more than one terminal branch. 
We thus obtain (4). It is easy to see that in order to generate D € Fij where 
£ C* or D £ G, two mutations are needed. This implies (5) and (6). 

□ 

Remark 12. Note that the argument above for a 6 F can be extended 
to show the following. Let T be any tree and 1 be a vector of edge lengths. 
Then 

V[a) = >K a U -^l(e)j + O (maxl(e) 2 ^ . 

In particular, if a is sufficiently small and 1(e) < 4alog(l/a) for all e, then 

23 [a] < n a 

for all a. 

Definition 13. The expected log- likelihood of a tree T with edge lengths 
1 is defined to be 

Lj,(T,l) = E x67 >logPr(s|r,l). 

Let Lx>(T,£) denote the expected log-likelihood of the tree T with all 
edge lengths I. We will show that tree T\ with all edge lengths a (including 
terminal edges) has large likelihood. We denote by (Ti,a) the tree T% where 
all edge lengths are exactly a. 

Lemma 14. The tree T\ satisfies 

Lv(Ti,a) > H{tt) + (1 + o(l))3aloga 
and a similar inequality is satisfied by T%, where 

a 

Proof. We prove the result for T\ \ the proof for T2 is identical. We first 
consider the sequences in F . By (2), the 23-probability of the sequence a is 

(7) V[a] =7r a (l - 2a + 0(a 2 )) =ir a + o(a log a), 

while the log-likelihood of a according to (T\ , a) is given by 

logPr(a|Ti,o) = log(vr Q (l - 2a + 0(a 2 ))) 

= log(vr Q ) - 2a + 0(a 2 ) 

= log(7r a ) + o(alogo). 
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Thus, the total contribution to Lx>{T\,a) coming from F is 

Y PHlog(Pr( ( 7|T 1 ,a))=^I?[a]log(Pr(a|r 1 ,a)) 

(8) 

= H(tt) + o(a log a). 

All sequences in Fi provide a contribution of 0(a 2 ) up to log corrections, 
which is also o(a log a). Similar situations obtain for sequences in i^j such 
that £ C* and for sequences in G. Let 

-F 1 * = [J -Fij- 

(i,i)ec* 

Then we have 

(9) J2 2>M log Pr(<7|ri,o) = o(o logo). 

aeA B \(F UF*) 

If belongs to C* , there are two possibilities: either is a cherry of 
T\ or it is a cherry of T2. First, if (i,j) is a cherry of T± and (a,i,j,(3) & Fij, 
then 

log(Pr((a,i,j,/3)|T 1 ,a)) = (l + (l))loga. 

[This follows by considering a single mutation along the internal edge that 
separates the cherry from the rest of the tree.] For 6 C* \ C(T\) 
[i.e., 6 C(T2)], we have, for all (a,i,j,f3) € Fjj that this character occurs 
if the only mutations are on the pair of terminal edges connected to i and j 
or, otherwise, if at least three mutations occurred. Hence, 

log(Pr((a,i,j,/3)|T 1 ,a)) = (2 + (l))loga. 

Since C* contains two cherries from T\ and two from T2, we obtain 

Y 2?[(7]logPr((7|ri,a) 

= Y V[(a,i,j,l3)}log(Pr((a,iJ,p)\Ti,a)) 

(ij')eC(T!) 

+ Y 2?[(a,i,j,/9)]log(Pr((a,i,j,/9)|r 1 ,o)) 

(«,*J,/3) 

(i,i)eC(T 2 ) 

(10) = (l + o(l))loga Y 25[(a,».i,/9)] 

{a,i,j,/3): 
(m)GC(Ti) 
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+ (2 + o(l))loga T>[{a t i,3 t p)] 

(a,ij',/9) 

= Q + 0(a 2 ))(l + o(l))2(21oga + loga) 

= (l + o(l))3ologa, 

where the first inequality in (10) follows from (4). Combining (8), (9) and 

(10) completes the proof of the lemma. □ 

Remark 15. Repeating the proof above shows that 

Ld(T u I) > H{tt) + (1 + o(l))3ologa 

if all the edge lengths 1 are in [a/2, 2a]. Note that since log a is negative, 

3.1a log a < 3a log a. 

Hence, for T = T\ or T = T2, if all the edge lengths 1 are in [a/2, 2a], for a 
sufficiently small, we have 

(11) L V (T, 1) >H(ir)+ 3.1a log a. 

For tree topologies different from those of T\ and T2, we will bound the 
maximum of their expected log-likelihood. The analysis considers two cases: 
either all of the edge lengths are smaller than 0(alog(l/a)) or there is at 
least one long edge. When there is one long edge, we only to consider the 
contribution from constant characters. 

Lemma 16. Let (T, 1) be a tree such that at least one of the edge lengths 
is greater than 4alog(l/a). If a is sufficiently small, then the following holds. 
For all constant characters a, 

(12) log(Pr(a|T, 1)) < log(vr a (l + 3.9a log a)) 
and the total contribution from all constant characters is 

V{<*) log(Pr(a|T, 1)) < £ V{a) log(vr a (l + 3.9a log a)) 

a a 

(13) 

<#(vr) + 3.8aloga. 

Proof. Let T be any tree for which the sum of the edge lengths is 
more than 4alog(l/a). To generate the constant character a, either the root 
chooses a and there are no mutations, or there are at least two mutations. 
Hence, 

Pr(a|T, 1) < vr Q (l -4alog(l/a)) + o(alog(l/a)) 
= 7r a (l + 4aloga) + o(aloga). 
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Hence, for a sufficiently small, (12) holds. 

Recall from (7) that T>[a] =ir a + o(aloga). Now, we have 

log(Pr(a|T, 1)) < £2?(a) log(7r a (l + 3.9aloga)) 

a a 

< ^(vr Q + o(a log a)) (log 7r Q + 3.9aloga) 

a 

= ^vTq, log(7TQ,) + 3.9aloga^7r a + o(a log a) 

a a 

= H(tt) + 3.9a log a + o(a log a). 
For a sufficiently small, (13) follows. □ 

When we restrict our attention to trees all of whose edge lengths are 
at most 4alog(l/a), we need to consider the contribution from constant 
characters and characters where one cherry differs from the rest of the tree. 

Lemma 17. Let (T,l) be a tree all of whose edge lengths are at most 
4alog(l/a) and suppose further that T has a topology different from those 
of T\ and T2. Then the following hold for sufficiently small a: 

• for all constant characters a, 

(14) Iog(Pr(a|r,l))<log(7r a ); 

• for all cherries £ C* \ C(T) and all a 7^ ft, 

(15) log(Pr((cM, j,/?)|T,l)) < 1.99aloga; 

• for all cherries £ C* R C(T) and all a 7^ (3, 

(16) log(Pr((cM, j,/?)|T,l)) < 0.99aloga; 

• finally, the total contribution from all constant characters and characters 
with one cherry different from the rest (i.e., F U F*) is 

J2 V(a)log(Pr(a\T,l)) 

<X)Z>(a)]og7r a + 1.99aloga £ V[(a,i,j,(3)] 

a (a,i,i,P): 

{i,j)£C*\C(T) 

(17) 

+ 0.99a log a £ V[(a,i,j, /3)] 

(.a,i,j,0): 

<#(vr) + 3.45aloga. 
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Before proving the above lemma, we make the following combinatorial 
observation: 

Observation 18. Let T ^T U T 2 . Then 

\C(T)<1C*\ <1. 

The observation follows by considering a tree T that contains at least one 
of the cherries of C* , say (1,2). Clearly, T can not also contain the cherry 
(1,5) or the cherry (2,4). And if it contains the cherry (4,5), then T = T±. 

Proof of Lemma 17. The bound (14) follows from Remark 12. 

We will now bound the contribution to Ld (T, 1) provided by the cherries 
in C*. Note that if £ C* and (i,j) (£ C(T), then for all a and 0, to 
generate (a,i,j,P), we need at least two mutations. Hence, 



for sufficiently small a. If £ C(T) n C*, then we need a mutation on the 
internal edge to the cherry (or we need at least two mutations). Hence, 



for sufficiently small a. Combining these inequalities with (4) and Observa- 
tion 18, which says that there are at least three cherries in C* \ C(T), we 
have 



log(Pr((a,i, j, 



/3)|T,l))<log(0(a 2 log 2 (l/a))) 
= 2(l + o(l)) log a 
< 1.99a log a, 



log(Pr((a,i,j,/3)|T,l))<log(4alog(l/a))+ (alog(l/a)) 
= (l + o(l)) log a 
< 0.99a log a, 



£ Z>(<7)log(Pr(<7|T,l)) 



o-GF UF* 



<£>(a)log(Pr(a|T,l)) 



a 



+ ^((«,^i,/3))log(Pr((a,f,j,/3)|T,l)) 



< 25(a) log 7r a + 1.99a log a V[(a,i,j 



0)} 



(i,i)eC*\C(T) 
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+ 0.99a log a £ V[(a,i,j,/3)] 
(i,j)eC*nC(T) 

< H (vr) + o(a log a) + (l + o(l))|(3 x 1.99 log a + 0.99 log a) 
<H(tt) + 3.45a log a, 
for sufficiently small a. □ 

Definition 19. Let D = (Di, . . . , Djv) consist of iV characters. We let 
L D (T,1)= £ log(Pr(D|T,l)). 

Using the Chernoff bound, together with the previous lemmas, we get the 
following lemma: 

Lemma 20. Suppose that D is drawn according to N independent sam- 
ples from the distribution V. Then, with probability 1 — e - ^^, for all trees 
(T, 1) with the topology of T\ or T2 and edge lengths 1(e) in [a/2, 2a] for all 
e, we have 

(18) Lt,(T,Y) > (H(tt) + 3.2a log a)N. 

Furthermore, for all trees (T, 1) with topologies different from those of T\ 
and T2, we have 

(19) £d(T,1) < (H(ir) + 3.3a log a)iV. 

Proof. Note that there is a positive number T> m [ n such that if all edge 
lengths of the tree are in [a/2, 4alog(l/a)] for all e, then for each a G A 5 , the 
probability that a is generated according to T is at least V m \ a . Moreover, 
T> m - m depends only on a and the matrix Q. 

Let 

X(a) = \{i:Di = a}\ 

denote the number of characters generated under the mixture distribution 
T> whose assignment is a. By the Chernoff bound (e.g., [15], Corollary 2.3), 

Pr(|X(<r) - V(a)N\ > eV(a)N) < 2exp(-e 2 P(a)iV/3). 

Let 

0.1alog(l/a) 



(20) 5 < 01 := 



|^| 5 log(l/P min )' 
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Taking a union bound over the \A\ 5 assignments, we have 

Pr(for all a € A 5 , \X(a) - V{a)N\ <NS)>1- exp(-fi(iV)). 

Now, if T is T\ or T2 and 1 is such that 1(e) is in [a/2, 2a] for all e, then, 
with probability 1 — exp(— O(n)), we have 

L D (T,1)= 2 A»logPr(a|T,l) 

> N(V(a) + o)logPr(cx|T,l) 

= iVL c (T,l) + <5iV ]T logPr(a|T,l) 

a£A 5 

>NLv(T,l) + 5N\Af log(V min ) 

> (H(ir) + 3.1aloga + 0.1aloga)iV 

= (H(ir) + 3.2a log a)iV, 

the last inequality following from (11) and (20). This proves (18). 

For the proof of (19), we have to consider two cases. The first case is 
where at least one of the edges of T is of length > 4alog(l/a). In this case, 
except with probability exp(— f2(n)), we have 

L D (T,1)= £ X(a)logPr((7|r,l) 

cr£A 5 

<^A>)log(7r Q (l + 3.9aloga)) by (12) 

a 

< N(V(a) - 5) log(vr Q (l + 3.9a log a)) 

a 

< N (h(tt) + 3.8a log a - \A\5 (^max log(vr a (l + 3.9a log a)) 

the last inequality following from (13). Therefore, if we take 

§<5 ..- 0.1alog(l/a) 

\A\ max Q log^^l + 3.9aloga)) ' 

then we obtain 

Ld(T, l)<N(H(n) + 3.7aloga), 

with probability 1 — exp(— Q(n)). 

The second case is where all of the edge lengths are of length at most 
4a log (1 /a). We now have 

L D (T,1)= X(<7)logPr(<7|r,l) 

creA 5 
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<]TX(a) log Pr(a|T,l)+ £ A>, 0) logPr((a, /3)\T, 1) 

<5^X(a)log7r a + 1.99aloga ^ X(a,i,j,0) 

a (a,i,j,/3): 

(i,j)ec*\c(T) 

+ 0.99a log a ^ X(a,i,j,/3) 
(.a,i,j,/3): 
(i,j)eC*nC(T) 

<^AT(P(a)-o)log7r Q 

a 

+ 1.99a log a N(V(a,i,j,P)-S) 

(i,j)&C*\C(T) 

+ 0.99a log a ^ N(V(a,i,j, p) - 5) 
(i,j)eC*nC(T) 



< N\H(n) + 3.45aloga + 20«5|^ralog(l/a) - o|^| maxlog(7r c 

where the second inequality follows from (14), (15) and (16) and the last 
inequality follows from (17). Therefore, choosing 

. . / 0.05 0.05alog(l/a) 

5 < 63 := mm „ , — - — 

V 20|_^l|^ maxa, — log(7r a 

we obtain that, except with probability exp(— Q(n)), we have 

Lr> (T, 1) < N(H(tt) + 3.4a log a) , 

as needed. Taking 5 = rnin(<5i, 62, £3) concludes the proof of the lemma. □ 

Lemma 21. Let e > and Zei \P be an (e , a) -regular prior on T . Then, 
with probability 1 - e""^', ifT^T 1 ,T 2 , 

^\<— exp(-0.1alog(l/a)iV). 

Proof. With probability 1 - e" Q(7V) , we have that (18) and (19) hold. 
Since ^> is (e, a)-regular we see that 

w(Ti) = yexp(LD(Ti,l))*(ri,l) dl 

>ea 7 exp(F(7r)iV)exp((3.2aloga)Af). 
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On the other hand, 

«;(r) = yexp(L D (r,i))*(r,i)di 

<exp(i!(7r)iV")exp((3.3a log a)iV). 

The claim follows. □ 

To complete the proof of Theorem 1, we need the well know fact that 
small conductance implies slow mixing. This is standard for discrete spaces; 
see, for example, [5]. Since we consider also the continuous case, we prove 
the following claim below. 

Lemma 22. Consider a discrete-time Markov chain P on a discrete or 
continuous state space with a unique stationary measure fj,. Assume, fur- 
thermore, that there exists a partition of the state space into three sets A\, 
A2 and B such that the probability of a move from A2 to A\ is [in the 
sense that J dfi(x)l(x E A2) J dP(x,y)l(y E Ax) = 0] and n(Ai) > //(.A2), 
n{B)/n{Ai) <e for i = 1,2. ' 

Let denote the distribution of the chain after t steps, where the initial 
distribution fi° is given by \i, conditioned to A<i- Then the total variation 
distance between fi 1 / 36 and \i is at least 1/3. 

Proof. Let t = l/3e and consider sequences (x±, . . . ,xt) of trajectories 
of the chain, where x\ is chosen according to the stationary distribution. 
Since each Xi is distributed according to the stationary distribution, the 
fraction of sequences that contain an element of B is, by the union bound, 
at most t£n(A2) = //(A2)/4. The fraction of sequences that have their first 
element in A2 is n(A2). Thus, conditioned on having x\ E A2, the probability 
that xt E B U Ai is at most 1/3. Since the stationary measure of B U A\ is 
at least 1/2, the claim follows. □ 

Proof of Theorem 1. The proof now follows from Lemmas 21 and 
22 — we take the two sets corresponding to T\ and T2 with all edge lengths 
strictly between and 00. The proof follows from the observation that T\ 
and T2 are not connected by either NNI, SPR or TBR transitions. □ 

3. Future directions. A popular program is MrBayes [13], which addi- 
tionally uses what is known as Metropolis-coupled Markov chain Monte 
Carlo, referred to as MC 3 [10]. Analysis of this approach requires more de- 
tailed results and it is unclear whether our techniques can be extended this 
far. Some theoretical work analyzing MC 3 in a different context was done 
by Bhatnagar and Randall [1]. 
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An interesting future direction would be to prove a positive result. In 
particular, is there a class of trees where we can prove fast convergence to 
the stationary distribution when the data are generated by a tree in this 
class? More generally, if the data are generated by a single tree, do the 
Markov chains always converge quickly to their stationary distribution? 
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